Questions

How can I reduce my data set in rows or select only a certain set of columns?

Objectives

To be able to reduce the number of rows in a data set

To be able to select the columns wanted of a data set

In many cases, we are working with data sets that contain more data than we need, or we want to inspect certain parts of the data set before we continue. Subsetting data sets can be challenging in base R, because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

The {dplyr} package

The {dplyr} package provides a number of very useful functions for manipulating data sets in a way that will reduce the probability of making errors, and even save you some typing time. As an added bonus, you might even find the {dplyr} grammar easier to read.

We’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

  1. select() (covered in this session)
  2. filter() (covered in this session)
  3. arrange() (covered in this session)
  4. mutate() (covered in Day 2 session)
  5. group_by() (covered in Day 2 session)
  6. summarize() (covered in Day 3 session)

Selecting columns, part one

Let us first talk about selecting columns. In {dplyr}, the function name for selecting columns is select()! Most {tidyverse} function names for functions are inspired by English grammar, which will help us when we are writing our code.

We first need to make sure we have the tidyverse loaded and the penguins data set at hand.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
penguins <- palmerpenguins::penguins

To select data, we must first tell select which data set we are selecting from, and then give it our selection. Here, we are asking R to select() from the penguins data set the island, species and sex columns

select(penguins, island, species, sex)
## # A tibble: 344 x 3
##    island    species sex   
##    <fct>     <fct>   <fct> 
##  1 Torgersen Adelie  male  
##  2 Torgersen Adelie  female
##  3 Torgersen Adelie  female
##  4 Torgersen Adelie  <NA>  
##  5 Torgersen Adelie  female
##  6 Torgersen Adelie  male  
##  7 Torgersen Adelie  female
##  8 Torgersen Adelie  male  
##  9 Torgersen Adelie  <NA>  
## 10 Torgersen Adelie  <NA>  
## # ... with 334 more rows

When we use select() we don’t need to use quotations, we write in the names directly. We can also use the numeric indexes for the column, if we are 100% certain of the order of the columns:

select(penguins, 1:3, 6)
## # A tibble: 344 x 4
##    species island    bill_length_mm body_mass_g
##    <fct>   <fct>              <dbl>       <int>
##  1 Adelie  Torgersen           39.1        3750
##  2 Adelie  Torgersen           39.5        3800
##  3 Adelie  Torgersen           40.3        3250
##  4 Adelie  Torgersen           NA            NA
##  5 Adelie  Torgersen           36.7        3450
##  6 Adelie  Torgersen           39.3        3650
##  7 Adelie  Torgersen           38.9        3625
##  8 Adelie  Torgersen           39.2        4675
##  9 Adelie  Torgersen           34.1        3475
## 10 Adelie  Torgersen           42          4250
## # ... with 334 more rows

In some cases, we want to remove columns, and not necessarily state all columns we want to keep. Select also allows for this by adding a minus (-) sign in front of the column name you don’t want.

select(penguins, -bill_length_mm, -bill_depth_mm)
## # A tibble: 344 x 6
##    species island    flipper_length_mm body_mass_g sex     year
##    <fct>   <fct>                 <int>       <int> <fct>  <int>
##  1 Adelie  Torgersen               181        3750 male    2007
##  2 Adelie  Torgersen               186        3800 female  2007
##  3 Adelie  Torgersen               195        3250 female  2007
##  4 Adelie  Torgersen                NA          NA <NA>    2007
##  5 Adelie  Torgersen               193        3450 female  2007
##  6 Adelie  Torgersen               190        3650 male    2007
##  7 Adelie  Torgersen               181        3625 female  2007
##  8 Adelie  Torgersen               195        4675 male    2007
##  9 Adelie  Torgersen               193        3475 <NA>    2007
## 10 Adelie  Torgersen               190        4250 <NA>    2007
## # ... with 334 more rows

Selecting columns, part two

These selections are quite convenient and fast! But they can be even better.

For instance, what if we want to choose all the columns with millimeter measurements? That could be quite convenient, making sure the variables we are working with have the same measurement scale.

We could of course type them all out, but the penguins data set has names that make it even easier for us, using something called tidy-selectors.

Here, we use a tidy-selector ends_with(), can you guess what it does? yes, it looks for columns that end with the string you provide it, here "mm".

select(penguins, ends_with("mm"))
## # A tibble: 344 x 3
##    bill_length_mm bill_depth_mm flipper_length_mm
##             <dbl>         <dbl>             <int>
##  1           39.1          18.7               181
##  2           39.5          17.4               186
##  3           40.3          18                 195
##  4           NA            NA                  NA
##  5           36.7          19.3               193
##  6           39.3          20.6               190
##  7           38.9          17.8               181
##  8           39.2          19.6               195
##  9           34.1          18.1               193
## 10           42            20.2               190
## # ... with 334 more rows

So convenient! There are several other tidy-selectors you can choose, which you can find here, but often people resort to three specific ones:

  • ends_with() - column names ending with a character string
  • starts_with() - column names starting with a character string
  • contains() - column names containing a character string

If you are working with a well named data set, these functions should make your data selecting much simpler. And if you are making your own data, you can think of such convenient naming for your data, so your work can be easier for you and others.

Lets only pick the measurements of the bill, we are not so interested in the flipper. Then we might want to change to starts_with() in stead.

select(penguins, starts_with("bill"))
## # A tibble: 344 x 2
##    bill_length_mm bill_depth_mm
##             <dbl>         <dbl>
##  1           39.1          18.7
##  2           39.5          17.4
##  3           40.3          18  
##  4           NA            NA  
##  5           36.7          19.3
##  6           39.3          20.6
##  7           38.9          17.8
##  8           39.2          19.6
##  9           34.1          18.1
## 10           42            20.2
## # ... with 334 more rows

The tidy selector can be combined with each other and other selectors. So you can build exactly the data you want!

select(penguins, island, species, year, starts_with("bill"))
## # A tibble: 344 x 5
##    island    species  year bill_length_mm bill_depth_mm
##    <fct>     <fct>   <int>          <dbl>         <dbl>
##  1 Torgersen Adelie   2007           39.1          18.7
##  2 Torgersen Adelie   2007           39.5          17.4
##  3 Torgersen Adelie   2007           40.3          18  
##  4 Torgersen Adelie   2007           NA            NA  
##  5 Torgersen Adelie   2007           36.7          19.3
##  6 Torgersen Adelie   2007           39.3          20.6
##  7 Torgersen Adelie   2007           38.9          17.8
##  8 Torgersen Adelie   2007           39.2          19.6
##  9 Torgersen Adelie   2007           34.1          18.1
## 10 Torgersen Adelie   2007           42            20.2
## # ... with 334 more rows

Selecting columns, part three

The last tidy-selector we’ll mention is where(). where() is a very special tidy selector, that uses logical evaluations to select the data. Let’s have a look at it in action, and see if we can explain it better that way.

Say you are running a correlation analysis. For correlations, you need all the columns in your data to be numeric, as you cannot correlate strings or categories. Going through each individual column and seeing if it is numeric is a bit of a chore. That is where where() comes in!

select(penguins, where(is.numeric))
## # A tibble: 344 x 5
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           39.1          18.7               181        3750  2007
##  2           39.5          17.4               186        3800  2007
##  3           40.3          18                 195        3250  2007
##  4           NA            NA                  NA          NA  2007
##  5           36.7          19.3               193        3450  2007
##  6           39.3          20.6               190        3650  2007
##  7           38.9          17.8               181        3625  2007
##  8           39.2          19.6               195        4675  2007
##  9           34.1          18.1               193        3475  2007
## 10           42            20.2               190        4250  2007
## # ... with 334 more rows

Magic! Let’s break that down. is.numeric() is a function in R that checks if a vector is numeric. If the vector is numeric, it returns TRUE if not it returns FALSE.

is.numeric(5)
## [1] TRUE
is.numeric("something")
## [1] FALSE

Let us look at the penguins data set again

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

The penguins data is stored as a tibble, which is a special kind of data set in R that gives a nice print out of the data. Notice, right below the column name, there is some information in <> marks. This tells us the class of the columns. Species and island are factors, while bill columns are “double” which is a decimal numeric class.

where() goes through all the columns and checks if they are numeric, and returns the ones that are. We could try the same for factors!

select(penguins, where(is.factor))
## # A tibble: 344 x 3
##    species island    sex   
##    <fct>   <fct>     <fct> 
##  1 Adelie  Torgersen male  
##  2 Adelie  Torgersen female
##  3 Adelie  Torgersen female
##  4 Adelie  Torgersen <NA>  
##  5 Adelie  Torgersen female
##  6 Adelie  Torgersen male  
##  7 Adelie  Torgersen female
##  8 Adelie  Torgersen male  
##  9 Adelie  Torgersen <NA>  
## 10 Adelie  Torgersen <NA>  
## # ... with 334 more rows

Selecting columns, challenges

Assignment

1a: Select from the penguins data set only columns that are factors.

1b: Now we lost flipper length! To make sure we keep flipper length, instead select columns what end with “mm”.

1c: Now select the columns island, species, and all numeric columns

Solution

## 1a
select(penguins, where(is.factor))
## # A tibble: 344 x 3
##    species island    sex   
##    <fct>   <fct>     <fct> 
##  1 Adelie  Torgersen male  
##  2 Adelie  Torgersen female
##  3 Adelie  Torgersen female
##  4 Adelie  Torgersen <NA>  
##  5 Adelie  Torgersen female
##  6 Adelie  Torgersen male  
##  7 Adelie  Torgersen female
##  8 Adelie  Torgersen male  
##  9 Adelie  Torgersen <NA>  
## 10 Adelie  Torgersen <NA>  
## # ... with 334 more rows
# 1b
select(penguins, where(is.integer))
## # A tibble: 344 x 3
##    flipper_length_mm body_mass_g  year
##                <int>       <int> <int>
##  1               181        3750  2007
##  2               186        3800  2007
##  3               195        3250  2007
##  4                NA          NA  2007
##  5               193        3450  2007
##  6               190        3650  2007
##  7               181        3625  2007
##  8               195        4675  2007
##  9               193        3475  2007
## 10               190        4250  2007
## # ... with 334 more rows
## 1c
select(penguins, island, species, where(is.numeric))
## # A tibble: 344 x 7
##    island    species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>            <dbl>         <dbl>             <int>       <int>
##  1 Torgersen Adelie            39.1          18.7               181        3750
##  2 Torgersen Adelie            39.5          17.4               186        3800
##  3 Torgersen Adelie            40.3          18                 195        3250
##  4 Torgersen Adelie            NA            NA                  NA          NA
##  5 Torgersen Adelie            36.7          19.3               193        3450
##  6 Torgersen Adelie            39.3          20.6               190        3650
##  7 Torgersen Adelie            38.9          17.8               181        3625
##  8 Torgersen Adelie            39.2          19.6               195        4675
##  9 Torgersen Adelie            34.1          18.1               193        3475
## 10 Torgersen Adelie            42            20.2               190        4250
## # ... with 334 more rows, and 1 more variable: year <int>

Filtering rows, part one

Now that we know how to select the columns we want, we should take a look at how we filter the rows. Row filtering is done with the function filter(), which takes statements that can be evaluated to TRUE or FALSE.

What do we mean with statements that can be evaluated to TRUE or FALSE? In the example with where() we used the is.numeric() function to evaluate if the columns where numeric or not. We will be doing the same for rows!

Now, using is.numeric() on a row won’t help, because every row-value in a column will be of the same type, that is how the data set works. All values in a column must be of the same type.

So what can we do? Well, we can check if the values meet certain criteria or not. Like values being above 20, or factors being a specific factor.

filter(penguins, body_mass_g < 3000)
## # A tibble: 9 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Dream            37.5          18.9              179        2975 <NA> 
## 2 Adelie  Biscoe           34.5          18.1              187        2900 fema~
## 3 Adelie  Biscoe           36.5          16.6              181        2850 fema~
## 4 Adelie  Biscoe           36.4          17.1              184        2850 fema~
## 5 Adelie  Dream            33.1          16.1              178        2900 fema~
## 6 Adelie  Biscoe           37.9          18.6              193        2925 fema~
## 7 Adelie  Torge~           38.6          17                188        2900 fema~
## 8 Chinst~ Dream            43.2          16.6              187        2900 fema~
## 9 Chinst~ Dream            46.9          16.6              192        2700 fema~
## # ... with 1 more variable: year <int>

Here, we’ve filtered so that we only have observations where the body mass was less than 3 kilos. We can also filter for specific values, but beware! you must use double equals (==) for comparisons, as single equals (=) are for argument names in functions.

filter(penguins, body_mass_g == 2900)
## # A tibble: 4 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Biscoe           34.5          18.1              187        2900 fema~
## 2 Adelie  Dream            33.1          16.1              178        2900 fema~
## 3 Adelie  Torge~           38.6          17                188        2900 fema~
## 4 Chinst~ Dream            43.2          16.6              187        2900 fema~
## # ... with 1 more variable: year <int>

What is happening, is that R will check if the values in body_mass_g are the same as 2900 (TRUE) or not (FALSE), and will do this for every row in the data set. Then at the end, it will discard all those that are FALSE, and keep those that are TRUE.

Filtering rows, part two

Many times, we will want to have several filters applied at once. What if you only want Adelie penguins that are below 3 kilos? filter() can take as many statements as you want! Combine them by adding commas (,) between each statement, and that will work as ‘and’.

filter(penguins, 
       species == "Chinstrap",
       body_mass_g < 3000)
## # A tibble: 2 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            43.2          16.6              187        2900 fema~
## 2 Chinst~ Dream            46.9          16.6              192        2700 fema~
## # ... with 1 more variable: year <int>

You can also use the & sign, which in R is the comparison character for ‘and’, like == is for ‘equals’.

filter(penguins, 
       species == "Chinstrap" &
         body_mass_g < 3000)
## # A tibble: 2 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Chinst~ Dream            43.2          16.6              187        2900 fema~
## 2 Chinst~ Dream            46.9          16.6              192        2700 fema~
## # ... with 1 more variable: year <int>

Here we are filtering the penguins data set keeping only the species “Chinstrap” and those below 3.5 kilos. And we can keep going!

filter(penguins, 
       species == "Chinstrap",
       body_mass_g < 3000,
       sex == "male")
## # A tibble: 0 x 8
## # ... with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>

But what if we want all the Chinstrap penguins or if body mass is below 3 kilos? When we use the comma (or the &), we make sure that all statements are TRUE. But what if we want it so that either statement is true? Then we can use the or character | .

filter(penguins, 
       species == "Chinstrap" | 
         body_mass_g < 3000)
## # A tibble: 75 x 8
##    species   island    bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
##  1 Adelie    Dream               37.5          18.9              179        2975
##  2 Adelie    Biscoe              34.5          18.1              187        2900
##  3 Adelie    Biscoe              36.5          16.6              181        2850
##  4 Adelie    Biscoe              36.4          17.1              184        2850
##  5 Adelie    Dream               33.1          16.1              178        2900
##  6 Adelie    Biscoe              37.9          18.6              193        2925
##  7 Adelie    Torgersen           38.6          17                188        2900
##  8 Chinstrap Dream               46.5          17.9              192        3500
##  9 Chinstrap Dream               50            19.5              196        3900
## 10 Chinstrap Dream               51.3          19.2              193        3650
## # ... with 65 more rows, and 2 more variables: sex <fct>, year <int>

This now gives us both all chinstrap penguins, and the smallest Adelie penguins! By combining AND and OR statements this way, we can slowly create the filtering we are after.

Filtering rows, challenges

Assignment

2a: Using a comma (‘,’), each expression must be TRUE for the end result. Choose all data where flipper length is larger or equal to 180, and the species is “Gentoo”

2b: Do the same using the & (and) sign.

2c: Filter the penguins data so that you have either chinstrap penguins, or penguins with body mass below or equal to 3 kilos.

Solution

## 2a 
filter(penguins, 
       flipper_length_mm >= 180,
       species == "Gentoo")
## # A tibble: 123 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # ... with 113 more rows, and 2 more variables: sex <fct>, year <int>
## 2b
filter(penguins, 
       flipper_length_mm >= 180 &
         species == "Gentoo")
## # A tibble: 123 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           46.1          13.2               211        4500
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           48.7          14.1               210        4450
##  4 Gentoo  Biscoe           50            15.2               218        5700
##  5 Gentoo  Biscoe           47.6          14.5               215        5400
##  6 Gentoo  Biscoe           46.5          13.5               210        4550
##  7 Gentoo  Biscoe           45.4          14.6               211        4800
##  8 Gentoo  Biscoe           46.7          15.3               219        5200
##  9 Gentoo  Biscoe           43.3          13.4               209        4400
## 10 Gentoo  Biscoe           46.8          15.4               215        5150
## # ... with 113 more rows, and 2 more variables: sex <fct>, year <int>
## 2c
filter(penguins, 
       species == "Chinstrap" | 
         body_mass_g < 3000)
## # A tibble: 75 x 8
##    species   island    bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
##  1 Adelie    Dream               37.5          18.9              179        2975
##  2 Adelie    Biscoe              34.5          18.1              187        2900
##  3 Adelie    Biscoe              36.5          16.6              181        2850
##  4 Adelie    Biscoe              36.4          17.1              184        2850
##  5 Adelie    Dream               33.1          16.1              178        2900
##  6 Adelie    Biscoe              37.9          18.6              193        2925
##  7 Adelie    Torgersen           38.6          17                188        2900
##  8 Chinstrap Dream               46.5          17.9              192        3500
##  9 Chinstrap Dream               50            19.5              196        3900
## 10 Chinstrap Dream               51.3          19.2              193        3650
## # ... with 65 more rows, and 2 more variables: sex <fct>, year <int>

Creating subsetted objects

So far, we have kept working on the penguins data set, without actually altering it. So far, all our actions have been executed, then forgotten by R. Like it never happened. This is actually quite smart, since it makes it harder to do mistakes you can have difficulties changing.

To store the changes, we have to “assign” the data to a new object in the R environment. Like the penguins data set, which already is an object in our environment we have called “penguins”.

We will now store a filtered version including only the chinstrap penguins, in an object we call chinstraps.

chinstraps <- filter(penguins, species == "Chinstrap")

You will likely notice that when we execute this command, nothing is output to the console. That is expected. When we assign the output of a function somewhere, and everything works (i.e., no errors), nothing happens in the console.

But you should be able to see the new chinstraps object in your environment, and when we type chinstraps in the R console, it prints our chinstraps data.

chinstraps
## # A tibble: 68 x 8
##    species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Chinstrap Dream            46.5          17.9               192        3500
##  2 Chinstrap Dream            50            19.5               196        3900
##  3 Chinstrap Dream            51.3          19.2               193        3650
##  4 Chinstrap Dream            45.4          18.7               188        3525
##  5 Chinstrap Dream            52.7          19.8               197        3725
##  6 Chinstrap Dream            45.2          17.8               198        3950
##  7 Chinstrap Dream            46.1          18.2               178        3250
##  8 Chinstrap Dream            51.3          18.2               197        3750
##  9 Chinstrap Dream            46            18.9               195        4150
## 10 Chinstrap Dream            51.3          19.9               198        3700
## # ... with 58 more rows, and 2 more variables: sex <fct>, year <int>

Maybe in this chinstrap data we are also not interested in the bill measurements, so we want to remove them.

chinstraps <- select(chinstraps, -starts_with("bill"))
chinstraps
## # A tibble: 68 x 6
##    species   island flipper_length_mm body_mass_g sex     year
##    <fct>     <fct>              <int>       <int> <fct>  <int>
##  1 Chinstrap Dream                192        3500 female  2007
##  2 Chinstrap Dream                196        3900 male    2007
##  3 Chinstrap Dream                193        3650 male    2007
##  4 Chinstrap Dream                188        3525 female  2007
##  5 Chinstrap Dream                197        3725 male    2007
##  6 Chinstrap Dream                198        3950 female  2007
##  7 Chinstrap Dream                178        3250 female  2007
##  8 Chinstrap Dream                197        3750 male    2007
##  9 Chinstrap Dream                195        4150 female  2007
## 10 Chinstrap Dream                198        3700 male    2007
## # ... with 58 more rows

Now our data has two less columns, and many fewer rows. A simpler data set for us to work with. But assigning the chinstrap data twice like this is a lot of typing, and there is a simpler way, using something we call the “pipe”.

The pipe %>%

We often want to string together series of functions. This is achieved using pipe operator %>%. This takes the value on the left, and passes it as the first argument to the function call on the right.

%>% is not limited to {dplyr} functions. It’s an alternative way of writing any R code:

The shortcut to insert the pipe operator is Ctrl+Shift+M for Windows/Linux, and Cmd+Shift+M for Mac.

In the chinstraps example, we had the following code to filter the rows and then select our columns.

chinstraps <- filter(penguins, species == "Chinstrap")
chinstraps <- select(chinstraps, -starts_with("bill"))

Here we first create the chinstraps data from the filtered penguins data set. Then use that chinstraps data to reduce the columns and write it again back to the same chinstraps object. It’s a little messy. With the pipe, we can make it more streamlined.

chinstraps <- penguins %>% 
  filter(species == "Chinstrap") %>% 
  select(-starts_with("bill"))

The end result is the same, but there is less typing and we can “read” the pipeline of data subsetting more like language, if we know how. You can read the pipe operator as “and then”.

So if we translate the code above to human language we could read it as:

take the penguins data set, and then keep only rows for the chinstrap penguins, and then remove the columns starting with bill and assign the end result to chinstraps.

Learning to read pipes is a great skill, R is not the only programming language that can do this (though the operator is different between languages, the functionality exists in many).

We can do the entire pipe chain step by step to see what is happening.

penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(species == "Chinstrap")
## # A tibble: 68 x 8
##    species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Chinstrap Dream            46.5          17.9               192        3500
##  2 Chinstrap Dream            50            19.5               196        3900
##  3 Chinstrap Dream            51.3          19.2               193        3650
##  4 Chinstrap Dream            45.4          18.7               188        3525
##  5 Chinstrap Dream            52.7          19.8               197        3725
##  6 Chinstrap Dream            45.2          17.8               198        3950
##  7 Chinstrap Dream            46.1          18.2               178        3250
##  8 Chinstrap Dream            51.3          18.2               197        3750
##  9 Chinstrap Dream            46            18.9               195        4150
## 10 Chinstrap Dream            51.3          19.9               198        3700
## # ... with 58 more rows, and 2 more variables: sex <fct>, year <int>
penguins %>% 
  filter(species == "Chinstrap") %>% 
  select(-starts_with("bill"))
## # A tibble: 68 x 6
##    species   island flipper_length_mm body_mass_g sex     year
##    <fct>     <fct>              <int>       <int> <fct>  <int>
##  1 Chinstrap Dream                192        3500 female  2007
##  2 Chinstrap Dream                196        3900 male    2007
##  3 Chinstrap Dream                193        3650 male    2007
##  4 Chinstrap Dream                188        3525 female  2007
##  5 Chinstrap Dream                197        3725 male    2007
##  6 Chinstrap Dream                198        3950 female  2007
##  7 Chinstrap Dream                178        3250 female  2007
##  8 Chinstrap Dream                197        3750 male    2007
##  9 Chinstrap Dream                195        4150 female  2007
## 10 Chinstrap Dream                198        3700 male    2007
## # ... with 58 more rows

So, for each chain step, the output of the previous step is fed into the next step, and that way the commands build on each other until a final end result is made.

And as before, we still are seeing the output of the command chain in the console, meaning we are not storing it. Let us do that, again using the assignment.

chinstraps <- penguins %>% 
  filter(species == "Chinstrap") %>% 
  select(-starts_with("bill"))

chinstraps
## # A tibble: 68 x 6
##    species   island flipper_length_mm body_mass_g sex     year
##    <fct>     <fct>              <int>       <int> <fct>  <int>
##  1 Chinstrap Dream                192        3500 female  2007
##  2 Chinstrap Dream                196        3900 male    2007
##  3 Chinstrap Dream                193        3650 male    2007
##  4 Chinstrap Dream                188        3525 female  2007
##  5 Chinstrap Dream                197        3725 male    2007
##  6 Chinstrap Dream                198        3950 female  2007
##  7 Chinstrap Dream                178        3250 female  2007
##  8 Chinstrap Dream                197        3750 male    2007
##  9 Chinstrap Dream                195        4150 female  2007
## 10 Chinstrap Dream                198        3700 male    2007
## # ... with 58 more rows

Sorting rows

So far, we have looked at subsetting the data. But some times, we want to reorganize the data without altering it. In tables, we are used to be able to sort columns in ascending or descending order.

This can also be done with {dplyr}’s arrange() function. arrange does not alter the data per se, just the order in which the rows are stored.

penguins %>% 
  arrange(island)
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

Here we have sorted the data by the island column. Since island is a factor, it will order by the facor levels, which in this case has Biscoe island as the first category. If we sort a numeric column, it will sort by numeric value.

By default, arrange sorts in ascending order. If you want it sorted by descending order, wrap the column name in desc()

penguins %>% 
  arrange(desc(island))
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

Sorting rows, challenges

Assignment

3a: Arrange the penguins data by body mass.

3b: Arrange the penguins data by descending order of flipper length.

3c: You can arrange on multiple columns. Try arranging the data by ascending island and descending flipper length, using a comma between the two arguments.

Solution

## 3a 
penguins %>% 
  arrange(body_mass_g)
## # A tibble: 344 x 8
##    species   island    bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
##    <fct>     <fct>              <dbl>         <dbl>            <int>       <int>
##  1 Chinstrap Dream               46.9          16.6              192        2700
##  2 Adelie    Biscoe              36.5          16.6              181        2850
##  3 Adelie    Biscoe              36.4          17.1              184        2850
##  4 Adelie    Biscoe              34.5          18.1              187        2900
##  5 Adelie    Dream               33.1          16.1              178        2900
##  6 Adelie    Torgersen           38.6          17                188        2900
##  7 Chinstrap Dream               43.2          16.6              187        2900
##  8 Adelie    Biscoe              37.9          18.6              193        2925
##  9 Adelie    Dream               37.5          18.9              179        2975
## 10 Adelie    Dream               37            16.9              185        3000
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
## 3b
penguins %>% 
  arrange(desc(flipper_length_mm))
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           54.3          15.7               231        5650
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           59.6          17                 230        6050
##  4 Gentoo  Biscoe           49.8          16.8               230        5700
##  5 Gentoo  Biscoe           48.6          16                 230        5800
##  6 Gentoo  Biscoe           52.1          17                 230        5550
##  7 Gentoo  Biscoe           51.5          16.3               230        5500
##  8 Gentoo  Biscoe           55.1          16                 230        5850
##  9 Gentoo  Biscoe           49.5          16.2               229        5800
## 10 Gentoo  Biscoe           49.8          15.9               229        5950
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
## 3c
penguins %>% 
  arrange(island,
          desc(flipper_length_mm))
## # A tibble: 344 x 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Gentoo  Biscoe           54.3          15.7               231        5650
##  2 Gentoo  Biscoe           50            16.3               230        5700
##  3 Gentoo  Biscoe           59.6          17                 230        6050
##  4 Gentoo  Biscoe           49.8          16.8               230        5700
##  5 Gentoo  Biscoe           48.6          16                 230        5800
##  6 Gentoo  Biscoe           52.1          17                 230        5550
##  7 Gentoo  Biscoe           51.5          16.3               230        5500
##  8 Gentoo  Biscoe           55.1          16                 230        5850
##  9 Gentoo  Biscoe           49.5          16.2               229        5800
## 10 Gentoo  Biscoe           49.8          15.9               229        5950
## # ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

Combining with {ggplot}

Some times, we want to subset data just to reduce the data in a plot, not necessarily for our whole pipeline. For instance, maybe you want to plot the data just from the Gentoo species, but don’t want to make another data set.

We can do that with pipes!

penguins %>% 
  filter(species == "Gentoo") %>% 
  ggplot(aes(x = bill_depth_mm,
             y = bill_length_mm,
             colour = species)) +
  geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).

Here, we first filter the penguins data, and then pass that output to ggplot with a pipe. This reduces the data immediately, without storing the intermediate.

Wrap-up

Now we’ve learned about subsetting and sorting our data, so we can create data sets that are suited to our needs. We also learned about chaining commands, the use of the pipe to create a series of commands that build on each other to create a final wanted output.